Search CORE

12 research outputs found

Confidence-based Ensembles of End-to-End Speech Recognition Models

Author: Ginsburg Boris
Gitman Igor
Laptev Aleksandr
Lavrukhin Vitaly
Publication venue
Publication date: 27/06/2023
Field of study

The number of end-to-end speech recognition models grows every year. These models are often adapted to new domains or languages resulting in a proliferation of expert systems that achieve great results on target data, while generally showing inferior performance outside of their domain of expertise. We explore combination of such experts via confidence-based ensembles: ensembles of models where only the output of the most-confident model is used. We assume that models' target data is not available except for a small validation set. We demonstrate effectiveness of our approach with two applications. First, we show that a confidence-based ensemble of 5 monolingual models outperforms a system where model selection is performed via a dedicated language identification block. Second, we demonstrate that it is possible to combine base and adapted models to achieve strong results on both original and target data. We validate all our results on multiple datasets and model architectures.Comment: To appear in Proc. INTERSPEECH 2023, August 20-24, 2023, Dublin, Irelan

arXiv.org e-Print Archive

Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition

Author: Acharya Shantanu
Ginsburg Boris
Lavrukhin Vitaly
Majumdar Somshubra
Publication venue
Publication date: 06/10/2022
Field of study

Automatic speech recognition models are often adapted to improve their accuracy in a new domain. A potential drawback of model adaptation to new domains is catastrophic forgetting, where the Word Error Rate on the original domain is significantly degraded. This paper addresses the situation when we want to simultaneously adapt automatic speech recognition models to a new domain and limit the degradation of accuracy on the original domain without access to the original training dataset. We propose several techniques such as a limited training strategy and regularized adapter modules for the Transducer encoder, prediction, and joiner network. We apply these methods to the Google Speech Commands and to the UK and Ireland English Dialect speech data set and obtain strong results on the new target domain while limiting the degradation on the original domain.Comment: To appear in Proc. SLT 2022, Jan 09-12, 2023, Doha, Qata

arXiv.org e-Print Archive

Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

Author: Bataev Vladimir
Ginsburg Boris
Korostik Roman
Lavrukhin Vitaly
Shabalin Evgeny
Publication venue
Publication date: 16/08/2023
Field of study

We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both. The proposed model uses an integrated auxiliary block for text-based training. This block combines a non-autoregressive multi-speaker text-to-mel-spectrogram generator with a GAN-based enhancer to improve the spectrogram quality. The proposed system can generate a mel-spectrogram dynamically during training. It can be used to adapt the ASR model to a new domain by using text-only data from this domain. We demonstrate that the proposed training method significantly improves ASR accuracy compared to the system trained on transcribed speech only. It also surpasses cascade TTS systems with the vocoder in the adaptation quality and training speed.Comment: Accepted to INTERSPEECH 202

arXiv.org e-Print Archive

A Chat About Boring Problems: Studying GPT-based text normalization

Author: Bakhturina Evelina
Bartley Travis M.
Ginsburg Boris
Graterol-Fuenmayor Mariana
Lavrukhin Vitaly
Zhang Yang
Publication venue
Publication date: 23/09/2023
Field of study

Text normalization - the conversion of text from written to spoken form - is traditionally assumed to be an ill-formed task for language models. In this work, we argue otherwise. We empirically show the capacity of Large-Language Models (LLM) for text normalization in few-shot scenarios. Combining self-consistency reasoning with linguistic-informed prompt engineering, we find LLM based text normalization to achieve error rates around 40\% lower than top normalization systems. Further, upon error analysis, we note key limitations in the conventional design of text normalization tasks. We create a new taxonomy of text normalization errors and apply it to results from GPT-3.5-Turbo and GPT-4.0. Through this new framework, we can identify strengths and weaknesses of GPT-based TN, opening opportunities for future work

arXiv.org e-Print Archive